2012 Presidential Campaign Donations in Ohio

Aaron Rank


require(ggplot2)
require(arm)
require(cvTools)
require(ggthemes)
require(pROC)
require(coefplot)
require(reshape2)
require(boot)
require(plyr)
require(dplyr)
require(lubridate)
require(glmnet)
require(Hmisc)
require(jsonlite)
require(zoo)
require(scales)
require(stringr)
require(zipcode)
theme_set(theme_minimal(20))
data <- read.csv('~/projects/udacity_DataR/P00000001-OH.csv', sep=',',row.names=NULL,stringsAsFactors = FALSE)

# The dataset had each column name shifted to the right one place with an additional column at the end filled with NAs. 
colnames(data)[1:18] <- colnames(data)[2:19] 
data <- data[1:18]

# create a sample size to be used for testing/training sets later and to avoid data snooping
smp_size <- floor(0.75 * nrow(data))

## set the seed to make your partition reproductible
set.seed(1)

# create index
index <- sample(seq_len(nrow(data)), size = smp_size)

# data for exploratory analysis
data <- data[index, ]

# hold data out to avoid data snooping
held_out_data <- data[-index, ]

dim(data)
## [1] 96993    18
str(data)
## 'data.frame':    96993 obs. of  18 variables:
##  $ cmte_id          : chr  "C00431445" "C00431445" "C00431445" "C00431445" ...
##  $ cand_id          : chr  "P80003338" "P80003338" "P80003338" "P80003338" ...
##  $ cand_nm          : chr  "Obama, Barack" "Obama, Barack" "Obama, Barack" "Obama, Barack" ...
##  $ contbr_nm        : chr  "WATSON, BARBARA" "THOMPSON, ANN" "GABEL, E MARIANNE" "KREUTZER, JACKIE" ...
##  $ contbr_city      : chr  "MENTOR" "CINCINNATI" "DELAWARE" "COLUMBUS" ...
##  $ contbr_st        : chr  "OH" "OH" "OH" "OH" ...
##  $ contbr_zip       : chr  "44060" "45239" "430151620" "43220" ...
##  $ contbr_employer  : chr  "NOT EMPLOYED" "SELF-EMPLOYED" "SELF-EMPLOYED" "RETIRED" ...
##  $ contbr_occupation: chr  "SP. ED TEACHER" "REAL ESTATE APPRAISER" "ATTORNEY" "CLINICAL PSYCHOLOGIST" ...
##  $ contb_receipt_amt: num  5 20 2500 25 450 19 15 25 20 200 ...
##  $ contb_receipt_dt : chr  "09-MAY-12" "24-OCT-12" "30-MAR-12" "02-OCT-12" ...
##  $ receipt_desc     : chr  "" "" "" "" ...
##  $ memo_cd          : chr  "" "X" "" "" ...
##  $ memo_text        : chr  "" "* OBAMA VICTORY FUND 2012" "" "" ...
##  $ form_tp          : chr  "SA17A" "SA18" "SA17A" "SA17A" ...
##  $ file_num         : int  791603 876050 896852 897092 876050 897092 897092 896964 896900 896745 ...
##  $ tran_id          : chr  "C15606983" "C28224513" "C14908631" "C23837302" ...
##  $ election_tp      : chr  "P2012" "G2012" "G2012" "G2012" ...
head(data[1:8])
##          cmte_id   cand_id       cand_nm         contbr_nm contbr_city
## 34337  C00431445 P80003338 Obama, Barack   WATSON, BARBARA      MENTOR
## 48125  C00431445 P80003338 Obama, Barack     THOMPSON, ANN  CINCINNATI
## 74084  C00431445 P80003338 Obama, Barack GABEL, E MARIANNE    DELAWARE
## 117452 C00431445 P80003338 Obama, Barack  KREUTZER, JACKIE    COLUMBUS
## 26082  C00431445 P80003338 Obama, Barack   HARRIS, CHARLES    COLUMBUS
## 116180 C00431445 P80003338 Obama, Barack      DANDAR, MATT      TIFFIN
##        contbr_st contbr_zip contbr_employer
## 34337         OH      44060    NOT EMPLOYED
## 48125         OH      45239   SELF-EMPLOYED
## 74084         OH  430151620   SELF-EMPLOYED
## 117452        OH      43220         RETIRED
## 26082         OH      43204     TIME WARNER
## 116180        OH      44883       HOMEMAKER
head(data[9:18])
##            contbr_occupation contb_receipt_amt contb_receipt_dt
## 34337         SP. ED TEACHER                 5        09-MAY-12
## 48125  REAL ESTATE APPRAISER                20        24-OCT-12
## 74084               ATTORNEY              2500        30-MAR-12
## 117452 CLINICAL PSYCHOLOGIST                25        02-OCT-12
## 26082                MANAGER               450        03-NOV-12
## 116180         SELF-EMPLOYED                19        05-OCT-12
##        receipt_desc memo_cd                 memo_text form_tp file_num
## 34337                                                   SA17A   791603
## 48125                     X * OBAMA VICTORY FUND 2012    SA18   876050
## 74084                                                   SA17A   896852
## 117452                                                  SA17A   897092
## 26082                                                   SA17A   876050
## 116180                                                  SA17A   897092
##          tran_id election_tp
## 34337  C15606983       P2012
## 48125  C28224513       G2012
## 74084  C14908631       G2012
## 117452 C23837302       G2012
## 26082  C29576890       G2012
## 116180 C24029761       G2012
describe(data$cand_nm)
## data$cand_nm 
##       n missing  unique 
##   96993       0      14 
## 
## Bachmann, Michele (349, 0%) 
## Cain, Herman (420, 0%), Gingrich, Newt (1075, 1%) 
## Huntsman, Jon (28, 0%), Johnson, Gary Earl (97, 0%) 
## McCotter, Thaddeus G (6, 0%) 
## Obama, Barack (68428, 71%) 
## Paul, Ron (3184, 3%), Pawlenty, Timothy (46, 0%) 
## Perry, Rick (180, 0%) 
## Roemer, Charles E. 'Buddy' III (112, 0%) 
## Romney, Mitt (21527, 22%) 
## Santorum, Rick (1514, 2%), Stein, Jill (27, 0%)

First Look at the Data

The data contains 19 features which makes it difficult to understand. Ideally, I would like to reduce the number of features to no more than 10. However, what I can tell is that the majority of the conbributions were made to Barack Obama (71%) and Mitt Romney (22%) with the other 12 candidates splitting the remaining 7% of contributions. I can also see that the candidate’s party is not included the dataset - it will be necessary to add this information as it is vital to answering the question in which I’m interested - To which party was the contribution made?

Feature Creation and Addition

# Census data for Ohio
pops <- fromJSON('~/projects/udacity_DataR/ohio_city_populations.json') 

# Add Population of the Contributor's City
getPopulation <- function(data){
  sapply(data$contbr_city,function(x) { 
    ifelse(x %in% names(pops),as.numeric(pops[as.character(x)]),NA)
  }) 
}
data$population <- getPopulation(data)

# Create a feature describing how many days from the election the contribution was made
getElectDelta <- function(data){
  data$contb_receipt_dt <- as.Date(data$contb_receipt_dt, "%d-%b-%y")
  election_day <- as.Date("2012-11-06")
  sapply(data$contb_receipt_dt,function(x) {
    return(election_day - x)
    })
}
data$elect_delta <- getElectDelta(data)

# Identify the Party of each Candidate
getCandParty <- function(data) {
  cand_party = list()    
  for (cand in unique(data$cand_nm)) {
    if (cand == 'Obama, Barack') {
      cand_party['Obama, Barack'] <- 'D'
    }     
    else if (cand == 'Stein, Jill') {
      cand_party['Stein, Jill'] <- 'G'
    } 
    else if (cand == 'Johnson, Gary Earl') {
      cand_party['Johnson, Gary Earl'] <- 'L'
    } 
    else {
      cand_party[as.character(cand)] <- 'R'
    }
  }
  sapply(as.character(data$cand_nm), function(x) {
    as.character(cand_party[x])
    })
}
data$cand_party <- getCandParty(data)

# Zip codes should be strings
data$contbr_zip <- substr(as.character(data$contbr_zip),1,5)

# I'm not concerned with negative donations - I'm not really sure what they mean
data <- subset(data,contb_receipt_amt >0)

write.csv(data,'~/projects/udacity_DataR/data_gender_pred.csv')
# Reload the data after running the python scripts
data <- read.csv('~/projects/udacity_DataR/data_w_salary_gender.csv',stringsAsFactors = FALSE)

# making an assumption that if the contributor's name, city, and employer show up more than once, it is the same person
# this would indicate that they made multiple contributions
getMultipleContb <- function(data){
  rows <- paste(data$contbr_nm, data$contbr_city, data$contbr_employer, sep=" ")  
  ifelse(duplicated(rows) == TRUE, 1, 0)
}

data$multiple_contb <- getMultipleContb(data)

# predicted_gender is a better variable name
data$predicted_gender <- data$gender 

# check to see if the contributor identified their gender in their name
getGender <- function(data){
  sapply(as.character(data), function(name) {
    
    if(grepl("MRS.",name)){
      return("female")
    }
    else if(grepl("MR.",name)){
        return("male")
    }
    else if(grepl(" MS.",name)){
      return("female")
    }
    else{
      return(as.character(NA))
    }
    })  
}

data$gender <- getGender(data$contbr_nm)

# create a feature based on if the contributor included MR. MRS., or MS. in their name
data$included_gender <- ifelse(is.na(data$gender), 0, 1)

# use predicted gender for those contributos that did not include MR., MRS., MS. in their contribution
getFinalGender <- function(data){
  sapply(1:length(data[,'gender']), function(i){
      if (is.na(data[i,'gender'])){
        data[i,'predicted_gender']
      }
      else{
        data[i,'gender']
      }
     
    })  
}
data$predicted_gender <- getFinalGender(data)


# Adds binary feature if the contributor's zip code is within a city
# Zip codes are used because I wanted to include the surrounding areas
add_city <- function(city){
   sapply(data$contbr_zip, function(zip){
    if (substring(as.character(zip),1,5) %in% city){
      1
    }
    else{
      0
    }
    })
}

# http://www.city-data.com allowed me to search for zipcodes in specific cities
cbus <- as.character(c(43002, 43004, 43016, 43017, 43026, 43035, 43054, 43065, 43081, 43082, 43085, 43119, 43123, 43137, 43147, 43201, 43202, 43203, 43204, 43205, 43206, 43207, 43210, 43211, 43212, 43213, 43214, 43215, 43217, 43219, 43220, 43221, 43222, 43223, 43224, 43227, 43228, 43229, 43230, 43231, 43235, 43240))
cleveland <- as.character(c(44101, 44103, 44104, 44105, 44106, 44107, 44111, 44112, 44113, 44114, 44115, 44117, 44119, 44120, 44121, 44125, 44127, 44134))
cincy <- as.character(c(45202, 45203, 45204, 45205, 45206, 45207, 45208, 45209, 45212, 45214, 45216, 45217, 45219, 45220, 45223, 45224, 45225, 45226, 45227, 45229, 45230, 45231, 45232, 45239, 45243))
data$cincy <- add_city(cincy)
data$cbus <- add_city(cbus)
data$cleveland <- add_city(cleveland)

# drop rows without an estimated salary
data <- subset(data, estimated_salary!= 'No Data ') 

# estimated salary was a string
data$estimated_salary <- as.numeric(data$estimated_salary)

# making an assumption that if the contributor's name, city, and employer show up more than once, it is the same person
# this would indicate that they made multiple contributions
getMultipleContb <- function(data){
  rows <- paste(data$contbr_nm, data$contbr_city, data$contbr_employer, sep=" ")  
  ifelse(duplicated(rows) == TRUE, 1, 0)
}
data$multiple_contb <- getMultipleContb(data)

# get additional info about contribution dates
data$weekday <- weekdays(as.Date(data$contb_receipt_dt))
data$month <- month(data$contb_receipt_dt)
data$year <- year(data$contb_receipt_dt)
data$yearmon <- as.yearmon(as.Date(data$contb_receipt_dt))

Contribution Amount

contb.quantile <- quantile(data$contb_receipt_amt, .95)
ggplot(aes(data$contb_receipt_amt), data=data) + geom_density() + xlim(0,contb.quantile)

ggplot(aes(log(data$contb_receipt_amt)), data=data) + geom_histogram()  + scale_x_discrete(limits = c(0:round(log(contb.quantile))))

The contribution amount is not normally distributed, it is closer to a log-normal distribution, and is multi-modal. It appears that the most frequent contribution amounts were less than or equal to $100 with a $100 donation being the most frequent. One can also see that the data contains peaks at specific values such as 100, 250, and 500 - this aspect of the data gives it the feeling of being discrete.

Election Type

## 
##       G2012     P P2012 
##     6 56965     4 47771

election_tp tells in which election the donation was made; namely the Primary or General Election which respectively accounted for 45% and 55% of all donations.

Contributor’s Occupation & Employer

## data$contbr_occupation 
##       n missing  unique 
##  104592     154    4135 
## 
## lowest : 2ND GRADE TEACHER        3D MODELER               401K SALES               4TH GRADE TEACHER        A/P CLERK               
## highest: YOUTH MINISTER           YOUTH MINISTRY           YOUTH SERVICES           YOUTH SERVICES ASSOCIATE ZONE BUSINESS MANAGER
## data$contbr_employer 
##       n missing  unique 
##  104416     330   11403 
## 
## lowest : (SELF) GREEN LEAF LAWN CARE           (SELF) STATE FARM INSURANCE COMPANIES 1 EDI SOURCE, INC                     1 FINANCIAL CORPORATION               1099                                 
## highest: ZOLMET TECHNOLOGIES INC.              ZOO GAMES INC.                        ZUCKERMAN CONSULTING GROUP, INC.      ZUKERMAN DAIKER & LEAR                ZUKERMAN DAIKER LEAR

There are over 6,000 unique occupations and 11,000 unique employers - far too many visually look at.

Contributor’s Population - Newly Created Feature

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      77   13840   37530  173700  297500  822600

Number of Days from the General Election - Newly Created Feature

elect_delta follows the log-normal distribution. This makes sense as American’s are typically apathetic towards politics especially when an election is months or years away. The two campaigns were also likely to have ramped up their efforts to solicit contributions as election day neared.

Candidate’s Party - Newly Created Feature

## data$cand_party 
##       n missing  unique 
##  104746       0       4 
## 
## D (75633, 72%), G (30, 0%), L (100, 0%), R (28983, 28%)

Democratic candidates received 72% of donations as opposed to the Republicans 28%. This is interesting because Ohio is typically considered a battle ground state meaning that the voting population is equally split between the two parties.

Predicted Gender of Contributor - Newly Created Feature

## 
## female   male 
##  51339  53407

Included Gender - Did the Contributor Indicate their Gender?

## 
##     0     1 
## 89410 15336

Out of the over 100,000 records, only about 15,000 records exist where the contributor provided an indication of their gender by using “MS.”, “MRS.”, or “MR.”. Using the way back machine, I looked at the Obama and Romney donation page on their website - neither seemed to ask for gender nor have a form for the donor to identify their gender.

Predicting Gender

Because so few contributors I need to predict the gender of the contributors that did not include MR., MRS., or MS. I attempted to use the gender package in R, but it was very slow (it never completed the task, but it had been running over 8 hours when I finally stopped it). As such, I decided to use Python’s NLTK library for the task and created a script called classify_gender.py. Running classify_gender.py creates a new csv file (data_gender_predicted.csv) with predicted gender names and prints “You classified 0.7704 correct on the test set” to stdout. After running both python scripts, I loaded the results back into R.

Contributor’s Estimated Salary - Newly Created Feature

Predicting Salaries

Trying to use all of the information provided in the original file and thinking that a person’s salary could have predictive power, I wrote a web scraping script, get_estimated_salaries.py, in Python that gets salary information for different occupations. Indeed.com allows you to enter an occupation and zip code and returns an average salary.

Multiple Contributions - Newly Created Feature

## 
##     0     1 
## 32290 72456

I made the assumption that if a contributor’s name, city, and employer show up more than once, it is the same person and this would indicate that they made multiple contributions.

First Thoughts

I am interested to which political party the contribution was made as I will attempt to build a model to predict this. I believe some of the relevant features will be the amount contributed, when the contribution was made, the gender of the contributor, the location of the contributor, the salary of the contributor, and if the contributor made multiple contributions. It would be nice to know the age of the contributor, but this information is difficult to acquire.

Bivariate Plots

data <- subset(data, cand_party == 'R' | cand_party == 'D' )
groupData <- function(initial_data,...){
# helper function to make frequent grouping of data with diffrent variables easier
gp_data <-group_by(initial_data,...)
gps_data<-summarise(gp_data, 
                  mean_contb = mean(contb_receipt_amt),
                  median_contb = median(contb_receipt_amt),
                  sum_contb = sum(contb_receipt_amt),
                  mean_pop = mean(population),
                  median_pop = median(population),
                  median_elect_delta = median(elect_delta),
                  mean_elect_delta = mean(elect_delta),
                  count = n()) 
gps_data
}
#write.csv(data, 'temp_data.csv')
#data <- read.csv('temp_data.csv', stringsAsFactors = FALSE)
data.by.party <- groupData(data[,1:33],cand_party)

# Comparison of contribution amount between parties
ggplot(data,aes(cand_party, log(contb_receipt_amt+1))) + geom_boxplot()

# Distribution of when contributions were made by party
ggplot(data,aes(elect_delta)) + geom_histogram(aes(fill=cand_party))

# Compares estimated salary of donors between both parties
ggplot(data, aes(cand_party,log(estimated_salary+10))) + geom_boxplot()

# Count of men & women donors by party
ggplot(data, aes(x=predicted_gender)) + geom_bar() + facet_wrap(~cand_party)

# Comparison of the distribution of contributions by weekday between parties
data$weekday <- factor(data$weekday, levels= c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

ggplot(data[order(data$weekday),],aes(weekday)) + geom_bar() + facet_wrap(~ cand_party) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(data[order(data$weekday),],aes(weekday, contb_receipt_amt)) + facet_wrap(~ cand_party)  + stat_summary(fun.y = mean, geom="bar") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

###Do people that include their gender donate to D or R?
data.cand_party.include_gender <- groupData(data[,1:33],cand_party, included_gender)
ggplot(data.cand_party.include_gender,aes(included_gender, count)) + geom_bar(stat='identity') + facet_wrap(~ cand_party)  + scale_x_discrete(breaks = c(0,1))

table(data[c('cand_party', 'included_gender')])
##           included_gender
## cand_party     0     1
##          D 75497   136
##          R 13783 15200
# Comparison of population betwen parties
ggplot(data, aes(cand_party, population)) + geom_boxplot() 

# Comparison of contribution amounts between paraties
ggplot(aes(x=log(contb_receipt_amt +1)), data=data) + geom_histogram(aes(fill=cand_party))

qplot(data=data, elect_delta, contb_receipt_amt) + ylim(0,quantile(data$contb_receipt_amt, .95))

ggplot(data, aes(yearmon, contb_receipt_amt)) + stat_summary(fun.y=mean, geom='line')  + scale_x_yearmon() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(data, aes(yearmon, estimated_salary)) + stat_summary(fun.y=mean, geom='line')  + scale_x_yearmon() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(data, aes(yearmon, population)) + stat_summary(fun.y=mean, geom='line')  + scale_x_yearmon() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Bivariate Analysis

It appears that contributions made to Republicans were higher than those made to Democrats; however, Democrats received a greater number of contributions. The gender breakdown shows that amongst donations made to Democrats, females made more contributions than males. Looking at Republican contributions, males made more contributions than females. This makes sense as females are more likely to identify with Democrats (http://www.people-press.org/2012/06/04/section-9-trends-in-party-affiliation/).

One of the most evident relationships was that contributions made to Republicans were higher than those made to Democrats while the opposite was true for the number of contributions. The most intriguing relationship, in my opinion, was between contributors who indicated their gender through the use of Mr, Ms, or Mrs in their name - roughly 70% of those who donated to Republicans indicated their gender while less than 1% of Democratic contributors indicated theirs. I have been unable to come up with an explanation for this - I thought that maybe Republicans included a place to indicate gender on their donation form/website, but I was unable to find any indication that this was true.

Multivariate Plots

ggplot(data,aes(elect_delta)) + geom_histogram(aes(fill=election_tp)) + facet_wrap(~ cand_party)

ggplot(data,aes(yearmon, contb_receipt_amt)) + geom_point(alpha = .5, position= position_jitter(),color="gray") + geom_line(stat = "summary", fun.y = mean,aes(color=cand_party),size=1.5) + ylim(0,quantile(data$contb_receipt_amt, 0.95)) + scale_x_yearmon()

data(zipcode)
data$zip <- clean.zipcodes(data$contbr_zip)

# zipcode package would change my zipcode in 'data' variable to zip codes in MA for some reason. Subsetting only Ohio values from zipcode seemed to solve this issue
oh <- subset(zipcode, state=='OH') 
map.data <- merge(subset(data, contb_receipt_amt > 0), oh, by.x='contbr_zip', by.y='zip')

map.plot <- ggplot(data=subset(map.data,cand_party == 'D' | cand_party == 'R'),aes(x=longitude, y=latitude, color=cand_party, size = contb_receipt_amt)) + geom_point(position = position_jitter(w=.08,h=.08)) + scale_size_continuous(breaks = c(50,100,500,1000,2500,5000,10000),range=c(1,15), name='Contribution Amount') + labs(x=NULL, y=NULL, color='Candidate Party') + scale_color_tableau()
map.plot <- map.plot + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.ticks=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank()) 
map.plot

Can we use this data to predict which party a contributor donated to?

Predicting to which party a contribution was made will require me to classify each contribution as either a Democrat or Republican (For simplicity, I’m going to drop any contributions made to third parties). As I will be tackling a binary classification problem, logistic regression is an appropriate model with which to start. Logistic regression uses a set of covariates to predict probabilities of (binary) class membership. We can then set a threshold to map these probabilities to class labels to solve the classification problem.

Logistic Regression in Detail

Logistic regression is an extension of the linear regression model with two important differences - the outcome variable and the error terms.

Outcome Variable

Key to any regression problem is the conditional mean of the outcome variable y given x. In linear regression, we assume that the conditional mean is a linear function taking values in \((-\infty, \infty)\): \[{E(y|x) =}\; \alpha + \beta{x}\] Unlike linear regression, the outcome variable of logistic regression has a conditional mean that takes values within [0,1]. In order to extend the linear regression model to logistic regression, we must map the outcome variable \(E(y|x)\) into [0,1] via a transformation called the logistic function. \[{E(y|x) =}\;\pi(x) = {e^\alpha+\beta{x}\over 1 + e^{\alpha+\beta{x}}}\] The logit function or log-odds function is a transformation of the logistic function. It can be useful in helping interpret the results. \[{g(x)=}\; \ln({\pi(x)\over {1-\pi(x)}}) = \alpha + \beta{x}\]

Error Terms

One of the key assumptions of linear regression is that the error terms follow independent Gaussian distributions with zero mean and constant variance. In logistic regression, the outcome variable can only be 0 or 1. Because of this, the error terms of logistic regression follow a Bernoulli distribution: \[\epsilon \sim \beta(0,\pi(1 - \pi))\]

Interpreting Results

In linear regression, the parameter \(\beta\) represents the change in the response variable for a unit change in the covariate. In logistic regression, the parameter \(\beta\) represents the change in the logit function for a unit change in the covariate. To interpret this change, we must define odds ratio.

The odds ratio of a binary event is given by the odds of the event divided by the odds of its complement: \[{OR =}\; {O(x=1)\over{O(x=0)}} = {\pi(i)/(1-\pi(1))\over{\pi(i)/(0-\pi(0))}}\]

Substituting the definition of \(\pi(x)\) into this equation yields \[{OR = }\; \epsilon^\beta\] The relationship between the odds ratio and \(\beta\) is what makes logistic regression such a powerful tool.

# prepare the data set
glm.data <- data[c(11, 19, 20, 21, 22, 26, 28, 29, 30, 31, 32, 33)]

# this is America, third parties get ignored
glm.data <- subset(glm.data, cand_party == 'D' | cand_party == 'R') 
glm.data <- na.omit(glm.data)

# set D = 1, R = 0
glm.data$cand_party <- ifelse(glm.data$cand_party == 'D', 1, 0)

# set male = 1, female = 0
glm.data$predicted_gender <- ifelse(glm.data$predicted_gender == 'male',1,0)

# set general election = 1, primary election = 0
glm.data$election_tp <- ifelse(glm.data$election_tp == 'G2012',1,0)

# create a sample size to be used for testing/training sets
smp_size <- floor(0.75 * nrow(glm.data))

## set the seed to make your partition reproductible
set.seed(1)

# training index
train_ind <- sample(seq_len(nrow(glm.data)), size = smp_size)
train <- glm.data[train_ind, ]
test <- glm.data[-train_ind, ]

# fit model
fit <- glm(cand_party ~ log(contb_receipt_amt + 1) + log(estimated_salary+1) + predicted_gender + cbus + cincy + cleveland   + log(elect_delta + 1) + election_tp +  log(population+1) + multiple_contb + included_gender, data=train, family="binomial")

# predict results on our test set
predpr <- predict(fit, newdata=test,type="response")
test$predicted <- predpr

# plot coefficients
coefplot(fit,intercept=FALSE)

Coefficient Plot

Instead of simply looking at a print out of the coefficients, I find it much easier to understand my model with a visualization. The y-axis of this plot containes all of the covariates in my model and the x-axis represents coefficients. The distance a point lies from zero is indicative of its influence in the model - the closer to zero, the less influential. Negative coefficients are associated with Republicans and positive coefficients with Democrats - this is true simply because I encoded Republicans as 0 and Democrats as 1. A negative coefficient will therefore “pull” down towards 0 and a postive coefficient will “push” up towards 1.

The most striking thing about this plot is the included_gender coefficient - it is very far away from zero on the negative side. Thinking about it, this makes sense because so many Republican contributors included their gender while nearly none of Democratic contributors included theirs. The size of this coefficient increases my suspision that there is something dubious about my included_gender variable. However, because I was unable to identify any explanations as to why Republican contributors included this information and this is a fun side project, I’ll ignore my reservations about excluding this variable in my model. However, if I were in this situation in a legitimate work or research setting, I would not include this variable without extensive addtional research.

Evaluating Performance

There are different ways to evaluate the performance a logistic regression model. I chose to use two techniques, the receiver operating characterstic (ROC) curve and cross-validation.

ROC Curve

A ROC curve is a graphical illustration of the performance of a binary classification system that plots the True Positive Rate against False Positive Rate. - True Positive Rate \(P(\hat{Y}_i=1 | Y_i=1)\) the proportion of 1s (Democrats, in our case) that are correctly identified - False Negative Rate \(P(\hat{Y}_i=1 | Y_i=0)\) the proportion of 1s (Democrats) that are incorrectly identified

In our ROC curve, we will see the proportion correctly identified as Democrats plotted against the proportion incorrectly identified as Democrats. Plotting the true positive rate against the false positive rate will show the ROC curve and by calculating the area under this curve (AUC), we are able to evaluate the performance of our model. A higher AUC indicates greater predictive power: - auc of 0.5 = no predictive power - auc of 1.0 = perfect predictive power

glm.roc <- roc(test$cand_party ~ predpr)
auc <- round(glm.roc$auc,2)
roc.df <- data.frame(specificities=glm.roc$specificities,sensitivities = glm.roc$sensitivities)

roc.plot <- ggplot(roc.df,aes(1-specificities, sensitivities)) + geom_line(aes(), size=0.5) + geom_abline(intercept = 0) + xlab('False Positive Rate (1 - Specificity)') + ylab('True Positive Rate (Sensitivity)') + annotate("text",x=.50,y=0.80,label=paste('The Area Under the Curve is',auc))
roc.plot

As previously mentioned, an AUC of 1 would indicate that our model has perfect predictive power. Our AUC is 0.89 - this is indicates that our model is performing very well. Next, we can perform cross-validation so that we can evaluate our model’s performance from a different perspective.

Cross-validation

The second method I used to evaluate my model is cross-validation, specifically a K-Fold Cross-Validation. In my opinion, the results from cross-validation provide a more intuitive way to evaluate our model because it returns a simple metric - the percent of predictions that were incorrectly classified.

Cross-validation involves the following steps: - Randomly split the dataset into k equal partitions. - Use partition 1 as the test set & the union of the other partitions as the training set. - Find generalization error. - Repeat steps 2-3 using a different partition as the test set at each iteration. - Take the average generalization error as the estimate of out-of-sample accuracy.

cv.fit <- cv.glm(train, fit, K=10)
cv.error <- cv.fit$delta
paste("The estimate of the out-of-sample error is",cv.error)
## [1] "The estimate of the out-of-sample error is 0.0961957904646189"
## [2] "The estimate of the out-of-sample error is 0.0961935159691379"

Our estimated out-of-sample error is 9% - this means that we are correctly classifying 91% of predictions!

Final Plots and Summary

Map

map.plot + ggtitle("Contribution Amount by Party and Zip Code")

The map makes it easy to spot the most populated cities in Ohio. The center mass is Columbus and to its left, you can see a cluster which contains Dayton at the top and Cincinatti at the bottom. The top right cluster is the Cleveland area and the top left is Toledo. The size of each point indicates the amount of the contribution. Contributions to Republicans appear to be larger than contributions to Democrats. However, there are more contributions made to Democrats than Republicans.

Contribution Amount Over Time

ggplot(data,aes(yearmon, contb_receipt_amt)) + geom_point(alpha = .5, position= position_jitter(),color="gray") + geom_line(stat = "summary", fun.y = mean,aes(color=cand_party),size=1.5) + ylim(0,quantile(data$contb_receipt_amt, 0.95)) + scale_x_yearmon() + labs(x='Month and Year', y='Mean of Contribution Amount', color="Candidate's Party") + ggtitle("Contribution Amount Over Time")

This plot depicts the mean contribution made to each party over time. As you can see, contributions made to Republicans were consistently higher than contributions made to Democrats. Further, the mean contribution amount for both parties decreased as it got closer to the election.

Number of Contributions Over Time

ggplot(data,aes(elect_delta)) + geom_histogram(aes(fill=cand_party)) + labs(x='Days From Election', y='Count', fill="Candidate's Party") + ggtitle("Number of Contributions Over Time")

The histogram reveals that as it got closer to election time, both Democrats and Republicans received a greater volume of contributions. This makes sense as Americans are typically apathetic towards politics, especially when an election is a year or more away. However, as the election got closer, compaigns would have ramped up their efforts to solicit contributions and people likely started paying more attention to the election.


Reflection